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Big data has remodeled the way organizations supervise, examine and 
leverage data in any industry. To safeguard sensitive data from public 
contraventions, several countries investigated this issue and carried out 
privacy protection mechanism. With the aid of quasi-identifiers privacy is 
not said to be preserved to a greater extent. This paper proposes a method 
called evolutionary tree-based quasi-identifier and federated gradient (ETQI- 
FD) for privacy preservations over big healthcare data. The first step 
involved in the ETQI-FD is learning quasi-identifiers. Learning quasi- 
identifiers by employing information loss function separately for categorical 
and numerical attributes accomplishes both the largest dissimilarities and 
partition without a comprehensive exploration between tuples of features or 
attributes. Next with the learnt quasi-identifiers, privacy preservation of data 
item is made by applying federated gradient arbitrary privacy preservation 
learning model. This model attains optimal balance between privacy and 
accuracy. In the federated gradient privacy preservation learning model, we 


evaluate the determinant of each attribute to the outputs. Then injecting 
Adaptive Lorentz noise to data attributes our ETQI-FD significantly 
minimizes the influence of noise on the final results and therefore 
contributing to privacy and accuracy. An experimental evaluation of ETQI- 
FD method achieves better accuracy and privacy than the existing methods. 
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1. INTRODUCTION 

In the recent few years, the data volume has expanded in an exponential manner and among this data 
there is an increasing amount of personal information contained within. This sensitive data has fascinated the 
recognition of those interested in producing more customized and personalized applications. This in turn 
infringes the individual privacy and ushers to the concerns that personal data may be broken and falsified. As 
a consequence, this occurrence has ushered new ultimatums to safeguard the data privacy as a key issue in 
privacy preserving health care data. 

Attribute centric anonymization scheme was proposed in [1] to safeguard from identity disclosure 
even posed with malicious users possessing certain amount of background knowledge or information. 
However, the execution time and memory incurred in anonymization was less focused. A robust 
anonymization and risk assessment scheme was designed in [2] that achieved four different objectives for bio 
medical data. Despite minimum execution time required for anonymization, a tradeoff between privacy and 
accuracy is said to be occurred. Best seed values were identified in [3] to minimize the information loss. 
However, learning from outliers and imbalanced data is still found to be one of the major drawbacks for 
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privacy preservation. Among the several strategies presented to solve this issue, data preprocessing solutions 

[4], [5] are found to be effective both in solving and implementation. Privacy preservation methods [6], [7] 

were discussed. 

A state of the art security and privacy challenges were discussed in detail in [8]. Yet another multi- 
label ensemble classification approach including decision tree algorithms was designed in [9]. A concrete 
survey on user privacy was investigated in [10]. A double decryption algorithm equipping the public key 
encryption with differential privacy was proposed in [11], therefore contributing to security. Yet another data 
warehouse solution called hive was presented in [12] using nearest similarity based clustering. A 
convolutional neural network (CNN) was customized for preserving the privacy via mapping [13] and deep 
learning [14] for recording electronic health sequences. However, all of these attributes are said to possess 
features that are said to be both sensitive and quasi attributes in general. A separate anonymization and 
reconstruction algorithm was designed in [15] using real dataset. Rao et al. [16], a survey of privacy 
preservation techniques for big data was investigated. Temuujin et al. [17], l-diversity algorithm along with 
cuckoo filter was proposed to enhance the data processing efficiency. However, information loss was not 
focused. 

Yet another two step clustering method was designed in [18] with the aid of equivalence classes to 
reduce the information loss of anonymous datasets. Despite minimum information loss, tradeoff between data 
privacy and quality was identified. To address this issue, conditional probability distribution along with 
Gibbs sampling was proposed in [19] therefore retaining better data utility. Recommender systems was 
designed in [20] with the objective of minimizing the query response. A new conjugate gradient (CG) 
method was introduced in [21] to solve the optimization issues. Clustering methods was developed in [22] 
with aid of log data. Metaheuristic optimization in neural network model was introduced in [23] for time 
series modeling. A hybrid algorithm was developed in [24] for increasing the security in e-business systems. 
Data-mining classification algorithms were introduced in [25] to detect the lung and breast cancer diagnose. 
New approach was developed in [26] to decision-making based on the characterization of cognitive tasks. 
Big data privacy models were introduced by the means of data masking methods. 

The major issues identified in the most of the existing privacy preservation mechanism tries to 
optimal a single objective, like either minimizing the information loss incurred during identification of quasi 
identifiers or enhancing the privacy preservation accuracy. However, single objective optimization may 
reduce the significance and efficient of healthcare data in general. In addition, the learning aspects involved 
in privacy preservation was less concentrated. So, to run such types of applications or tasks i.e., preserving 
privacy of big healthcare data with minimum information loss, communication overhead and higher privacy 
preservation accuracy is one of the challenging issues. To overcome the issue, evolutionary tree-based quasi- 
identifier and federated gradient (ETQI-FD) is developed for privacy preservations involving big healthcare 
data. 

In this paper, we have designed an evolutionary tree-based indexed quasi identification algorithm. 
Here, the numerical and categorical attributes are not merged into single data node. This may minimize the 
communication overhead involved in identification of quasi identifiers. The proposed ETQI-FD method also 
with the deployment of federated adaptive Lorentz privacy preservation algorithm minimizes the information 
loss involved in privacy preservation. Arbitrary privacy-preserving adaptation (APA) function is used to 
enhance the accuracy. The main novelty and contribution of the proposed method are summarized as follows: 
— The main contribution of the proposed ETQI-FD method is introduced for finding the optimal quasi 

identifiers and thereby preserving the privacy of healthcare data. The contribution is achieved with the 
novelty of the ETQI model, and federated adaptive Lorentz privacy preservation algorithm. 

— On the contrary to existing works, the ETQI-FD method is introduced with the novelty of the 
evolutionary tree-based indexed quasi identification model to achieve the quasi identifiers for big 
healthcare data with lesser execution time. The new idea of the information loss function is employed 
independently to map among sample sets and attribute values for categorical and numerical attributes. 
The generalization and suppression process is carried for numerical and categorical attributes through 
information loss function, therefore minimizing the communication overhead. Next, an anonymization 
process is carried to reduce the communication overhead via quasi identifier identification. 

— In order to learn the features for preserving the privacy of big healthcare data via identified quasi 
identifiers, the ETQI-FD method is introduced with the novelty of the federated adaptive Lorentz privacy 
preservation algorithm. First, the new idea of gradient descent function is used for performing the linear 
regression to determine the quasi identifiers with all patients. Second, the new idea of the APA function is 
utilized for improving the accuracy. Third, with the new idea of injecting adaptive Lorentz distribution 
(ALD), preservation of privacy is done based on the threshold value. This helps to improve the accuracy 
and privacy of each patient big healthcare data. 
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The rest of the paper is organized as follows. Section 2 presents the research method. The proposed 
ETQI-FD method is described in section 3. Section 4 analyses the performance and discussion of the 
proposed method. Finally, the conclusion is presented in Section 5. 


2. RESEARCH METHOD 

Privacy preservation is essential when certain user’s data are provided to a third party for processing 
of any other distinct objective. With the upsurge of big data, in recent years, several health and medical 
institutions have obtained huge medical data. As a result, safeguarding the private big healthcare data of an 
individual becomes a paramount research topic. In this work, we have proposed a method called ETQI-FD 
for privacy preservations over big healthcare data. Figure 1 shows the block diagram of ETQI-FD method. 
As shown in Figure 1, we first formulate the ETQI model in detail. Based on the quasi identifiers learnt, 
privacy preservation mechanism using federated gradient arbitrary learning model is proposed. The elaborate 
description of the ETQI-FD method is given as follows. 


Diabetes 130-US 
hospitals dataset Evolutionary Tree 


Quasi Identifier 


Federated Gradient Learning 
Big Healthcare Privacy Preservation 


Figure 1. Block diagram of evolutionary tree-based quasi identifier and federated gradient 


2.1. Evolutionary tree-based indexed quasi identification model 

In this section, we sketch out our evolutionary tree-based indexed quasi identification model to learn 
the quasi identifiers with minimum execution time required for anonymization. At first, large volume 
Diabetes 130-US hospitals dataset is considered as input. Figure 2 shows the block diagram of the quasi 
identification process. 


Big Data Mapping between samples set 
dataset ‘D’ and attribute value 


Information Loss (numerical 
attribute) 


Information Loss 
(categorical attribute) 


Optimized and anonymized Quasi identifier 


Figure 2. Block diagram of evolutionary tree-based indexed quasi identification process 


Figure 2 shows the evolutionary tree-based indexed quasi identification model. Let us consider the 
big dataset as input. Information loss function is used separately to map between sample sets and attribute 
value for categorical and numerical attributes. As given in the Figure 2, let us consider a Table 1 with 
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columns ‘a;,a,...,a,, and rows named ‘r4,r3,..., rp’ with the column’s name representing the feature or 
attribute and row representing the instance or record. Now let us denote by ‘A’ the set of attributes and 's’ the 
set of samples. To each sample 's’ corresponds a tuple ‘(v1,V2,.--,V,)’, where ‘vi’ refers to the value of the 
attribute ‘a,’ for the underlying sample. Let ‘V,’ denote the set of all values of the feature or attribute ‘A’. 
Then, a function 'f' is defined that maps 's’ to 'V’ via the equation as in (1). 


f(s) = [a4 (s), a3(s), ....., an (s)] (1) 


From the (1), the function 'f’, for the corresponding sample 's’ is derived based on the attribute value of the 
respective sample ‘a,(s)’. With the above function, a four tuple formation for our work is represented as 
‘(S, A, V, f)’. The information is specified as a Table 1 given for diabetes 130-US hospitals dataset, where 'V’ 
and 'f' are discarded, by considering only ‘(S, A)’. 


Table 1. Example of diabetes 130-US hospitals dataset 


Patient Number Race Age Gender Time in Hospital Hbalc 
1 African American 15 Male 11 None 
2 Other 25 Male 13 >7 
3 African American 40 Female 21 >7 
4 Other 65 Female 22 >6 
5 African American 60 Male 16 None 


From the Table 1 set of examples, ‘S = {1,2,3,4,...,50}’ and the attribute set is ‘A= 
{Patient number, race, age, gender, time in hospital, Hbatc,....}’, then ‘Vpatientnumber = L 2,3,4,5’, 
‘Viace = AfricanAmerican, Other, AfricanAmerican, Other, AfricanAmerican’, ‘Vage = 15, 25, 40, 65, 50’, 
‘VGender = Male, Male, Female, Female, Male’, = ‘Vrime in hospital = 11, 13, 21, 22,16’,  ‘Vipaic = None, 
> 7,>7,> 6,None. Then, for the first record when the attribute is ‘Patient number’ the value of ‘È is 
‘(f(1) = AfricanAmerican, 15, Male, 11, None]’, therefore ‘Race(1) = AfricanAmerican’, ‘Age(1) = 15’, 
‘Gender(1) = Male’, ‘Time in hospital (1) =11’, ‘Hbalc(1) = None’. Let ‘ET’ represents the 
evolutionary tree as illustrated in figure for ‘Time in hospital’. 

The evolutionary tree given in the Figure 3 is utilized in our work is used to generalize the value of 
each categorical and numerical attribute. An ‘y — dissimilar’ quasi identifier is a subset of attributes that 
becomes a pivotal element when at most a ratio ‘1 —y’ of samples is discarded. Moreover, a subset of 
attributes partitions two samples ‘s,’ and ‘s,’ if ‘s,’ and ‘s3’ have different values of at least one attribute of 


that subset. A ‘y — separation’ quasi identifier is a subset of attributes that separates at least a ratio ‘y’ of all 
probable instance pairs. 


Train using linear Transmission of initial Patients train Server pools 
regression model regression value to regression locally using regression results 
different patients APA function with using ALD without 
their own data items accessing data 


Figure 3. Evolutionary tree of time in hospital 
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Let ‘y = {@4, Q2, .., An} where ‘a’ symbolizes a group. Then, the information loss ‘IL,’ for the 
numerical attribute representation of a given dataset ‘D’ through generalization and suppression is 
mathematically evaluated as in (2). 


Gi, — Gi 
Loum = Iv! oe = 
lh 7 Ti 


From (2), the information loss for numerical attributes ‘IL pum is evaluated based on the highest ‘Gip’ and 
least value ‘Gi,’ of the tuples in group, highest ‘Ti,’ and least ‘Ti,’ values of the tuples in dataset ‘D’ 
respectively. In a similar manner, the information loss for categorical attributes ‘IL,’ is evaluated as shown 
in (3) based on the height of the column values H(G) and the height of the evolutionary tree of the column 


‘H (ET(q))’ respectively. 


5 a 
ILeat = ly| H(ET(G )) 
j (3) 


(2) 


cy 


Finally, the quasi identifiers based on the resultant values of information loss arrived via numerical attributes 
‘ILnum » information loss arrived via categorical attributes ‘IL,.;’ is obtained as (4). 


QID = a € Index [Al] Ya ALnum) = YaCLeat)] (4) 


The pseudo code representation of evolutionary tree-based indexed quasi identification is given as follows. 


Algorithm 1. Evolutionary tree-based indexed quasi identification 

Input: patients `P = P,P, ..,P,, big data dataset ‘DS’, attributes ‘aj,a,...,am’, 
Output: optimized and anonymized quasi identifiers 

Begin 

For each big data dataset ‘DS’ with `n’ attributes ‘A=4j,,Q),..,a,' and Patients `P’ 
For each function ‘f’ defined that maps ‘S’ to ‘V’ as given in (1) 

Evaluate information loss for numerical attributes using (2) 

Evaluate information loss for categorical attributes using (3) 

Estimate quasi identifiers using (4) 

End for 

End for 

End 


As given in the evolutionary tree-based indexed quasi identification algorithm the objective remains 
in learning the quasi identifiers in a timely manner with minimum communication overhead incurring while 
maintaining the links between evolutionary tree. This is achieved by first performing a mapping function 
based on two tuples. Next, with the generalization and suppression process performed via information loss 
function, therefore contributing to execution time involved in anonymization process. 

Finally, with the aid of index function to the attribute information loss, the numerical and 
categorical attributes are not merged into single data node to minimize communication overhead, instead 
while one data node is locally merged while the rest are associated with the representative data node. In this 
manner, a significant amount of communication overhead is said to be reduced while performing 
anonymization process during the quasi identifier identification. 


2.2. Federated gradient arbitrary privacy preservation learning model 

Nowadays, patient’s privacy is an analytical circumstance in big health care data. However, 
conventional machine learning techniques that purely depend on patient’s log files and behavioral aspects are 
not adequate to preserve it. Hence, the health care data security should have numerous considerations to take 
into account supplementary information to safeguard patient’s data. 

In this work, federated learning is a privacy preservation algorithm is implemented that incorporates 
a collaborative learning model with centralized approach without the necessity of uploading the local dataset 
into one server. With this, robust machine learning is said to be ensured thus permitting to address issues 
such as data privacy. 

Privacy-preserving federated machine learning process is specifically designed on the concept of 
Differential Privacy. Let us consider a dataset ‘DS’ quasi identifiers ‘QI’ about some set of ‘œ’ patients are 
stored. This database can be queried by ‘q,,q2,.-.,d,’ authorized users, among which there may be several 
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malicious users trying to analyze data. Let ‘q1, q2, :-, qn’ be their queries, ‘ans, ansz, ..., anSp’ representing 
the answers for these queries. 

The main idea behind the design of differential privacy is to bestow with the answers to queries that 
it was impractical to differential the existence or non-existence of information. Thus, if there are two 
databases ‘DS’ and ‘DS”’, differing by only one record ‘QI, QI, ... QIz, Qin’, ‘QL, Qo, ... Ql, ... QIn’, then 
the probability distributions of ‘Prob(DS)’ and ‘Prob(DS')’ must be very close to each other. Figure 4 shows 
the block diagram of federated gradient arbitrary privacy preservation learning model. 

As shown in the Figure 4, each patient ‘P,’ owns ‘nm’ data items ‘(P,,Q;)’, where ‘i € [1,n]’. Each 
data item is first initialized with ‘w’ attributes and ‘m’ labels, i.e., ‘Pii Piz, =- Pins Git) qiz» +» din’ based on 
linear regression. Then, to optimize the learning process, a gradient descent function is utilized and 
mathematically expressed as in (5). 


Vg(Dř, wi) = ALpumt beat [Qi f(D}, w;')] (5) 


From (4), the gradient descent function ‘g()’ with set of data item ‘DP’, weight matrix of patient ‘i’ 
after ‘n’ iteration is obtained based on the derivative loss function of numerical attribute ‘ILyym’, categorical 
attribute ‘IL.,,’, extent of variability between the predicted value ‘f(D?,w;')’ and the actual value ‘Q,;’ 
respectively. Next, with the fraction of local nodes (patients) selected to undergo training configuration is 
performed by upgrading weight and is formulated as in (6). 


wt? = wi — ni Vg, wi) (6) 


Figure 4. Block diagram of federated gradient arbitrary privacy preservation learning model 


From (6), the weight upgrade ‘w*?’ for each patient ‘P,’ is arrived at based on the existing weight 
‘wi’ and a learning factor ‘ni’. Next, with the configured resultant value, an arbitrary privacy-preserving 
adaptation (APA) function is utilized with the objective of improving the accuracy of big healthcare data. For 
this, the determinant of attribute is evaluated. By extracting the determinant of the same attribute or feature 
from tuple, the mean determinant of every attribute or feature ‘Dj(p;)’ to the output is evaluated as in (7). 


1 n 
Dp) = EX (pi [pi]),j € [1 p] i 


With (7) determinant value obtained, two adaptation components ‘c,’ and ‘c3’ are introduced, where 
‘c4€ [0,1] and ‘cze [0,1] respectively, where ‘c,’ denotes a threshold whether the attribute to the output is 
high or low and its value is defined by patients. In other words, if mean determinant of every attribute 
exceeds the threshold ‘c,’ possess greater determination to output. Then, we inject ALD or Lorentz noise to 
all these attributes. On the other hand, while the determinant value obtained is lesser than the threshold, ‘c,’, 
original data with probability ‘1 — prob’ is selected and to inject adaptive Lorentz Distribution to some 
attributes with probability ‘prob’. This is mathematically expressed as in (8). 


ae Pija 2 Cy 
Pij pija <cy (8) 
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: ; Dj 
From (8), ‘a’ refers to the ratio of determinant results a= 7 However, when ‘a <c,’, the 
j=l? ij 
mathematical representation is formulated as in (9). 
T Pij with probability prob 
Pij = lp;jwith probability 1 — prob (9) 


With the obtained probability measures, a small amount of adaptive Lorentz distribution is injected into the 
attributes as in (10). 


1 
Pij = Pij [DF] LP; Po Y] ab) 


From (10), ‘f[p; po, y]’ represent the ALD or Lorentz noise which is estimated as in (11). 


1 


ee ee ec 
Dyp üy E ay (2 =o] ny [(P — Po) + y? 


(1) 


From (11), ‘po’ specifies the input parameter of consideration, ‘y? represents the scaling factor with a 
: $ f 1 i ; 
maximum value being specified as a for each patient ‘p= pọ’ respectively. The pseudo code 


representation of federated adaptive Lorentz privacy preservation is given as follows. 


Algorithm 2. Federated adaptive Lorentz privacy preservation 

Input: patients ‘P=P,,Py,..,P,’, big data dataset ‘DS’, attributes ‘ay,a,...,am_’, quasi 
identifiers ‘Ql=QI,,Qly,...QIs,..-Qny’ 

Output: accurate and adaptive privacy preserved identifiers 
Initialize fraction of local nodes or patients ‘Diz, Piz -Piw Git qiz > Gin’ 
Initialize `c’ and ‘cy’ 

Begin 

For each big data dataset ‘DS’ with attributes ‘A’ 

For each Quasi Identifiers ‘QI = QL, Ql, ... Qs, ... Qn’ 

Perform linear regression based on gradient descent function using (5) 
//configuration 

Upgrade weight using (6) for each patient `P” 

Evaluate mean determinant of every attribute using (7) 

Estimate adaptive Lorentz Distribution using (8), (9) and (10) 

Return (privacy preserved data items) 

End for 

End for 

End 


As given in the federated adaptive Lorentz privacy preservation algorithm, the objective remains in 
accurate and adaptive privacy preserved identifies with higher accuracy. First linear regression based on 
gradient descent function is evolved for each patient with the obtained quasi identifiers. Next, for each 
patient, weight is upgraded by utilizing APA function therefore contributing to accuracy. Finally, 
preservation of privacy for each quasi identifier is made by injecting adaptive Lorentz distribution according 
to the threshold value. With this, the accuracy and privacy are said to be ensured for each patient big 
healthcare data. 


3. EXPERIMENTAL SETTINGS 

In this section, a detailed analysis of experimental results has been presented to evaluate the 
performance of ETQI-FD for privacy-preserving of big healthcare data via quasi-identifier. The efficiency of 
the ETQI-FD method is determined along with the metrics such as execution time, communicational 
overhead, accuracy, and information loss by using diabetes 130-US hospitals dataset. Using this dataset, 
privacy preserving experiments are conducted via Python. The implementation is conducted with the 
hardware specification of Windows 10 Operating system, core 13-4130 3.40 GHZ processor, 4 GB RAM, 
1 TB (1000 GB) hard disk, ASUSTek P5G41C-M motherboard, internet protocol. For accomplishing the 
experimental evaluation, the ETQI-FD considers a number of patient data in the range of 500-5000 from the 
diabetes 130-US hospitals. 
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4. RESULTS AND DISCUSSION 
4.1. Analysis of communicational overhead 

The communicational overhead refers to the overhead incurred during maintenance of links while 
designing evolutionary tree. This is mathematically expressed as in (12). 


n 


CO = X P; »MEM [QID] 
2 (12) 


From (12), the communicational overhead ‘CO’ is measured based on the number of patients involved in 
simulation process ‘P;’ and the memory consumed during the identification of quasi identifier (MEM-QID). 
It is measured in terms of kilobytes (KB). Results of paired tests for comparing the communication overhead 
until a migration produced by the algorithms with privacy preservation tasks. The results of the comparison 
of the communication overhead for the proposed method ETQI-FD and existing attribute centric 
anonymization [1], robust anonymization and risk assessment [2] are graphically depicted in Figure 5. 

The experimental results on the communication overhead on the diabetes 130-US hospitals dataset 
is depicted in Figure 5. To conduct our experiments, the number of patients provided as input was selected in 
the range of 500 to 5000. However, with a simulation involving ‘500’ patients and the communication link 
established while designing evolutionary tree for identification of quasi identifier being ‘2 KB’ using 
ETQI-FD, ‘3 KB’ using [1] and ‘4 KB’ using [2], the overall communication overhead was observed to be 
1000 KB, 1500 KB and 2000 KB respectively. The reason behind the improvement is owing to the 
application of evolutionary tree-based indexed quasi identification algorithm in proposed ETQI-FD. 
Therefore, the communication overhead involved during privacy preservation is said to be reduced. The 
average communication overhead result of ETQI-FD is reduced by 32% when compared to [1] and 49% 
compared to [2]. 
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Figure 5. Comparison of ETQI-FD, attribute centric anonymization [1], robust anonymization and risk 
assessment [2] with respect to communication overhead 


4.2 Analysis of accuracy 

In this section, privacy preservation accuracy in our work refers to the accuracy maintained during 
quasi identifier identification and also the privacy preserved in big healthcare data. This is mathematically 
expressed as (13). 


i=! (13) 
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From (13), the accuracy ‘A’ is measured based on the percentage ratio of accurate identification of quasi 
identifier and privacy preservation in big healthcare data ‘Pap’ to the number of patients involved during 
simulation process ‘P,’. It is measured in terms of percentage (%). The results of the comparison of the 
accuracy factor for proposed method ETQI-FD and existing attribute centric anonymization [1], robust 
anonymization and risk assessment [2] are graphically depicted in Figure 6. 
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Figure 6. Comparison of ETQI-FD, attribute centric anonymization [1], robust anonymization and risk 
assessment [2] with respect to accuracy 


Followed by the experimental results of communication overhead on the diabetes 130-US hospitals 
dataset, the accuracy rate is depicted in Figure 6. The experiments conducted to estimate the accuracy were 
obtained in the range of 500 to 5000. Let us consider ‘1000’ patients data taken from the dataset for 
conducting the experiments. By applying the ETQI-FD, ‘487’patient’s data are correctly recognized hence 
the accuracy is 97.35%. Whereas ‘471’and ‘461’patient’s data are correctly detected by using [1], [2] and 
their accuracy percentages are 94.25% and 92.15% respectively. This is owing to the implementation of 
information loss detection function separately for numerical and categorical attributes via generalization and 
suppression. By applying this function separately, for each feature, a separate evolutionary tree was 
constructed. Followed by quasi identifier were identified to preserve the privacy for improving the accuracy. 
The average comparison results demonstrate that the accuracy of the proposed ETQI-FD is considerably 
improved by 11% and 14% during privacy preservation when compared to existing [1] and [2] respectively. 


4.3. Analysis of information loss 
The information loss is referred to as the amount of loss incurred during privacy preservation. This 
is mathematically estimated as (14). 


n 
P 
IL = ye 100 
tts (14) 


From (14), information loss ‘IL’ is measured on the basis of the patients involved in simulation during 
privacy preservation ‘P,’ and the amount of patient data lost ‘Py’. It is measured in terms of percentage (%). 
The results of information loss comparing the performance of ETQI-FD with existing attribute centric 
anonymization [1] and robust anonymization and risk assessment [2]. 

Figure 7 illustrates the variation in information loss for different numbers of patients obtained at 
different time intervals. However, with a simulation involving ‘500’ patients and the amount of patient data 
lost being ‘12’using ETQI-FD, ‘17’ using [1] and ‘30’ using [2], the overall information loss was observed 
to be 2.4%, 3.4% and 6% respectively. The reason behind the minimization of information loss using ETQI- 
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FD is due to the application of federated adaptive Lorentz privacy preservation algorithm. This is helps to 
reduce the information loss. Finally, average of ten results indicates that the information loss is consderably 
minimized by 32% and 60% when compared to existing [1], [2] respectively. 
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Figure 7. Comparison of ETQI-FD, attribute centric anonymization [1], robust anonymization and risk 
assessment [2] with respect to information loss 


5. CONCLUSION 

A machine learning privacy preservation method has been proposed for big healthcare data for 
privacy preservation of healthcare data in case of a high level of anonymization. Evolutionary tree-based 
indexed quasi identification model is introduced to map between sample sets and attribute values according 
to numerical and categorical attributes separately. Also, it integrates the privacy preservation model with the 
federated learning, therefore injecting noise based on a threshold value. It takes advantage of the existing 
features and generates only a few attributes as quasi identifiers. The proposed method is compared with the 
two existing methods (attribute centric anonymization, robust anonymization and risk assessment). The 
proposed ETQI-FD method achieves better privacy and accuracy. The proposed work is further suggested to 
use new research work with the implementation of cryptographic techniques for privacy preserving. 
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